Pre-trained Model

# Pre-trained Model

Magma

Magma, developed by Microsoft Research, is a multimodal foundational model designed to enable complex task planning and execution through the combination of vision, language, and action. Pre-trained on large-scale visual-language data, it possesses capabilities in language understanding, spatial intelligence, and action planning, allowing it to excel in tasks such as UI navigation and robot operation. This model provides a powerful foundation framework for multimodal AI agent tasks, with broad application prospects.

OpenEMMA

OpenEMMA is an open-source project that replicates Waymo's EMMA model, providing an end-to-end framework for motion planning of autonomous vehicles. This model utilizes pre-trained Vision-Language Models (VLMs) such as GPT-4 and LLaVA to integrate text and forward-facing camera inputs, achieving precise predictions of future path points and providing decision rationale. OpenEMMA aims to provide accessible tools for researchers and developers to advance research and applications in autonomous driving.

Model Training and Deployment

InternVL2_5-26B

Internvl2 5 26B

InternVL2_5-26B is an advanced multimodal large language model (MLLM) developed based on InternVL 2.0. It has been further enhanced through significant training and testing strategies, as well as improvements in data quality. The model retains the core architecture of its predecessor, the 'ViT-MLP-LLM', while integrating the newly pre-trained InternViT along with various pre-trained large language models (LLMs) such as InternLM 2.5 and Qwen 2.5, utilizing randomly initialized MLP projectors. The InternVL 2.5 series models demonstrate exceptional performance in multimodal tasks, particularly in visual perception and multimodal capabilities.

Meta Llama 3.3

Meta Llama 3.3 is a state-of-the-art multilingual large pre-trained language model (LLM) with 70 billion parameters, specifically optimized for multilingual dialogue use cases. It outperforms many existing open-source and proprietary chat models on common industry benchmarks. The model utilizes an optimized Transformer architecture, along with supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) to enhance its usefulness and safety according to human preferences.

ViTLP

ViTLP is a visually guided generative text layout pre-trained model designed to enhance the efficiency and accuracy of document intelligent processing. This model combines OCR text localization and recognition capabilities, enabling rapid and accurate text detection and recognition on document images. The pre-trained version, ViTLP-medium (380M parameters), provides a balanced solution under constraints of computational resources and the scale of pre-training datasets, ensuring performance while optimizing inference speed and memory usage. ViTLP's inference speed typically ranges from 5 to 10 seconds per page on an Nvidia 4090, making it competitive compared to most OCR engines.

Qwen2.5-Coder-3B

Qwen2.5 Coder 3B

Qwen2.5-Coder-3B is a large language model within the Qwen2.5-Coder series, emphasizing code generation, reasoning, and debugging. Built on the robust Qwen2.5 architecture, this model significantly improves code generation, reasoning, and debugging capabilities by increasing training tokens to 5.5 trillion, incorporating source code, textual code fundamentals, synthetic data, and more. The Qwen2.5-Coder-32B has become the most advanced open-source code language model available, matching the encoding capabilities of GPT-4o. Furthermore, Qwen2.5-Coder-3B provides a comprehensive foundation for real-world applications like code assistants, enhancing coding capabilities while maintaining strengths in mathematics and general comprehension.

Coding Assistant

DTLR

DTLR is a detection-based handwritten text line recognition model, improved from DINO-DETR, designed for text recognition and character detection. The model is pre-trained on synthetic data and then fine-tuned on real datasets. It holds significant relevance in the OCR (Optical Character Recognition) field, especially in enhancing the accuracy and efficiency of handwritten text processing.

OLMoE

OLMoE is a fully open, state-of-the-art expert mixture model with 130 million active parameters and a total of 690 million parameters. All data, code, and logs associated with the model have been released. It provides a comprehensive overview of resources related to the paper 'OLMoE: Open Mixture-of-Experts Language Models'. This model has significant applications in pre-training, fine-tuning, adaptation, and evaluation, marking a milestone in the field of natural language processing.

OpenCity

OpenCity is an open-source spatiotemporal foundation model focused on the field of traffic prediction. The model effectively captures and standardizes complex spatiotemporal dependencies in traffic data by integrating the Transformer architecture and graph neural networks, enabling zero-shot generalization across different urban environments. It is pre-trained on large-scale, heterogeneous traffic datasets to learn rich, generalizable representations that can be seamlessly applied to various traffic prediction scenarios.

EXAONE-3.0-7.8B-Instruct

EXAONE 3.0 7.8B Instruct

EXAONE-3.0-7.8B-Instruct is a bilingual (English and Korean) pre-trained generative model developed by LG AI Research, featuring 780 million parameters. The model is pretrained on a curated dataset of 8 trillion tokens and has undergone supervised fine-tuning along with direct preference optimization, demonstrating competitively benchmarked performance compared to similar-sized open models.

Meta Llama 3.1-405B

Meta Llama 3.1 405B

Meta Llama 3.1-405B is a series of large multilingual pre-trained language models developed by Meta, including models with sizes of 8B, 70B, and 405B. These models feature an optimized transformer architecture, tuned through supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) to align with human preferences for helpfulness and safety. The Llama 3.1 model supports several languages, including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. It excels in various natural language generation tasks and outperforms many existing open-source and closed chat models in industry benchmarking tests.

Index-1.9B-Pure

Index 1.9B Pure

Index-1.9B-Pure is a lightweight version of the Index series model, specifically designed for text generation. It has been pre-trained on 2.8T of Chinese and English data and outperforms comparable models on several benchmark scores. This model is particularly filter out all instruction-related data to validate the influence of instructions on benchmark, making it suitable for fields requiring high-quality text generation.

AI Content Generation

Index-1.9B-Chat

Index 1.9B Chat

Index-1.9B-Chat is a 1.9B parameter dialogue generation model. It utilizes SFT and DPO alignment techniques, combined with RAG to achieve few-shot role-playing customization, boasting high dialogue趣味性和 customizability. The model is pre-trained on a 2.8T corpus of predominantly English and Chinese data and demonstrates leading performance on multiple benchmark datasets.

AI Conversational Agents

YAYI-UIE Information Extraction Large Model

YAYI UIE Information Extraction Large Model

YAYI-UIE (Yayi Information Extraction) large model, developed by the Algorithm Team of CAS Wenbo, is a model fine-tuned by instructions on a million-level manually constructed high-quality information extraction dataset. It can uniformly train information extraction tasks, including Named Entity Recognition (NER), Relation Extraction (RE), and Event Extraction (EE), covering the structural extraction of various scenarios such as general, security, finance, biomedicine, healthcare, and business. The open-source nature of this model aims to promote the development of the Chinese pre-trained large model open-source community and build an ecosystem for the Yayi large model through collaborative open-sourcing.

Qwen2

Qwen2 is a series of pre-trained and instruction-tuned models that support up to 27 languages, including English and Chinese. These models have excelled in multiple benchmark tests, particularly demonstrating significant improvements in coding and mathematical capabilities. Qwen2 supports a context length of up to 128K tokens, making it suitable for handling long-text tasks. Moreover, the Qwen2-72B-Instruct model exhibits comparable safety performance to GPT-4, significantly outperforming the Mistral-8x22B model.

GLM-4V-9B

GLM-4V-9B is a new generation pre-trained model released by Zhi谱AI. It supports 1120*1120 high-resolution English and Chinese multi-turn dialogues, as well as visual understanding capabilities. In multimodal evaluations, GLM-4V-9B demonstrates superior performance surpassing GPT-4-turbo-2024-04-09, Gemini 1.0 Pro, Qwen-VL-Max, and Claude 3 Opus.

GLM-4-9B-Chat-1M

GLM 4 9B Chat 1M

GLM-4-9B-Chat-1M is a new generation of pre-trained model released by ZhiPu AI, an open-source version of the GLM-4 series. It has demonstrated high performance in various benchmark datasets covering semantics, mathematics, reasoning, code, and knowledge. This model not only supports multi-turn dialogue but also features advanced functionalities like web browsing, code execution, custom tool invocation, and long-text reasoning. It supports 26 languages, including Japanese, Korean, and German, and a special version with 1M context length is available, suitable for developers and researchers handling large amounts of data and working in multilingual environments.

GLM-4 Series

GLM-4 Series is the next-generation pre-trained model launched by Zhipu AI, including GLM-4-9B, GLM-4-9B-Chat, GLM-4-9B-Chat-1M, and GLM-4V-9B. These models excel in semantic understanding, mathematical reasoning, code execution, and support up to 26 languages. They also possess advanced functionalities like web browsing and code execution. The GLM-4V-9B model boasts high-resolution visual understanding capabilities, making it suitable for multimodal applications.

Mixtral-8x22B

Mixtral-8x22B is a pre-trained generative sparse expert language model developed by the Mistral AI team, aiming to advance the open development of artificial intelligence. With 141B parameters, it supports various optimization deployment methods, such as half-precision and quantization, to meet the needs of different hardware and application scenarios. Mixtral-8x22B can be used for text generation, question answering, and translation tasks in natural language processing.

HPT

HPT (Hyper-Pretrained Transformers) is a novel multi-modal large language model framework introduced by the HyperGAI research team. It enables the efficient and scalable training of large multi-modal foundation models, capable of understanding various input modalities including text, images, and videos. The HPT framework can be trained from scratch or efficiently fine-tuned using existing pre-trained vision encoders and/or large language models.

Chronos

Chronos is a series of pre-trained time series forecasting models based on language model architectures. Time series are converted into sequences of tokens through scaling and quantization, and then a language model is trained using cross-entropy loss. After training, multiple future trajectories are sampled given the historical context to obtain probabilistic predictions. The Chronos model has been trained on a large amount of publicly available time series data and synthetic data generated using Gaussian processes.

Gemma-7B

Gemma-7B is a large-scale pre-trained language model developed by Google with 70 billion parameters, designed to deliver robust natural language processing capabilities. It is capable of understanding and generating text, supports multiple languages, and is suitable for a variety of application scenarios.

Gemma-2b

Gemma-2b is part of an open-source pre-trained language model series released by Google, offering various variants of different scales. It is capable of generating high-quality text and is widely applied in domains such as question answering, summarization, and reasoning. Compared to other similar models, Gemma-2b has a smaller model size, which allows for deployment in various hardware environments. The Gemma series aims for safe and efficient artificial intelligence technology, making cutting-edge language model technologies more accessible to researchers and developers.

TinyLlama

The TinyLlama project aims to pre-train a 1.1B Llama model on 3 trillion tokens. With some optimizations, we can achieve this in just 90 days using 16 A100-40G GPUs. Training began on 2023-09-01. We adopt the same architecture and tokenizer as Llama 2. This means TinyLlama can be used in many open-source projects built on top of Llama. Additionally, with only 1.1B parameters, TinyLlama's compactness allows it to meet the needs of many applications with limited computational and memory resources.

Google T5

Google T5 is a unified text-to-text transformer that, through pre-training on a massive text corpus, achieves state-of-the-art results on multiple NLP tasks. It provides code for loading, preprocessing, mixing, and evaluating datasets and can be used to fine-tune existing published pre-trained models.

Cargoship

Cargoship is a collection of AI models that provides pre-trained models and an easy-to-use API, allowing you to integrate AI into your software without requiring machine learning expertise. Cargoship's models cover areas such as text processing, text generation, image recognition, image generation, and audio transcription, among others. Users can choose the models they need. The Cargoship model collection is constantly growing and keeps pace with advancements in the AI field. Users can choose to self-host models or obtain personal API keys.

Development Platform

Google Vision Transformer

Google Vision Transformer

Google Vision Transformer is an image recognition model based on the Transformer encoder. It is pre-trained on a large-scale image dataset and can be used for tasks such as image classification. The model is pre-trained on the ImageNet-21k dataset and fine-tuned on the ImageNet dataset, possessing strong image feature extraction capabilities. The model processes image data by dividing the image into fixed-size image blocks and linearly embedding these blocks. Additionally, the model incorporates positional encoding before the input sequence to handle sequential data within the Transformer encoder. Users can perform image classification and other tasks by adding a linear layer on top of the pre-trained encoder. The advantages of Google Vision Transformer lie in its powerful image feature learning ability and widespread applicability. The model is freely available for use.

AI image detection and recognition

Featured AI Tools

Flow AI

Flow is an AI-driven movie-making tool designed for creators, utilizing Google DeepMind's advanced models to allow users to easily create excellent movie clips, scenes, and stories. The tool provides a seamless creative experience, supporting user-defined assets or generating content within Flow. In terms of pricing, the Google AI Pro and Google AI Ultra plans offer different functionalities suitable for various user needs.

Video Production

NoCode

NoCode is a platform that requires no programming experience, allowing users to quickly generate applications by describing their ideas in natural language, aiming to lower development barriers so more people can realize their ideas. The platform provides real-time previews and one-click deployment features, making it very suitable for non-technical users to turn their ideas into reality.

Development Platform

ListenHub

ListenHub is a lightweight AI podcast generation tool that supports both Chinese and English. Based on cutting-edge AI technology, it can quickly generate podcast content of interest to users. Its main advantages include natural dialogue and ultra-realistic voice effects, allowing users to enjoy high-quality auditory experiences anytime and anywhere. ListenHub not only improves the speed of content generation but also offers compatibility with mobile devices, making it convenient for users to use in different settings. The product is positioned as an efficient information acquisition tool, suitable for the needs of a wide range of listeners.

MiniMax Agent

MiniMax Agent is an intelligent AI companion that adopts the latest multimodal technology. The MCP multi-agent collaboration enables AI teams to efficiently solve complex problems. It provides features such as instant answers, visual analysis, and voice interaction, which can increase productivity by 10 times.

Multimodal technology

Tencent Hunyuan Image 2.0

Tencent Hunyuan Image 2.0

Tencent Hunyuan Image 2.0 is Tencent's latest released AI image generation model, significantly improving generation speed and image quality. With a super-high compression ratio codec and new diffusion architecture, image generation speed can reach milliseconds, avoiding the waiting time of traditional generation. At the same time, the model improves the realism and detail representation of images through the combination of reinforcement learning algorithms and human aesthetic knowledge, suitable for professional users such as designers and creators.

Image Generation

OpenMemory MCP

OpenMemory is an open-source personal memory layer that provides private, portable memory management for large language models (LLMs). It ensures users have full control over their data, maintaining its security when building AI applications. This project supports Docker, Python, and Node.js, making it suitable for developers seeking personalized AI experiences. OpenMemory is particularly suited for users who wish to use AI without revealing personal information.

FastVLM

FastVLM is an efficient visual encoding model designed specifically for visual language models. It uses the innovative FastViTHD hybrid visual encoder to reduce the time required for encoding high-resolution images and the number of output tokens, resulting in excellent performance in both speed and accuracy. FastVLM is primarily positioned to provide developers with powerful visual language processing capabilities, applicable to various scenarios, particularly performing excellently on mobile devices that require rapid response.

Image Processing

LiblibAI

LiblibAI is a leading Chinese AI creative platform offering powerful AI creative tools to help creators bring their imagination to life. The platform provides a vast library of free AI creative models, allowing users to search and utilize these models for image, text, and audio creations. Users can also train their own AI models on the platform. Focused on the diverse needs of creators, LiblibAI is committed to creating inclusive conditions and serving the creative industry, ensuring that everyone can enjoy the joy of creation.

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase